# Machine Learning as a Tool for Hypothesis Generation

This repository contains a complete step-by-step implementation for the paper Machine Learning as a Tool for Hypothesis
Generation (Ludwig and Mullainathan).


## Availability Statements

We have made as much of the code and data for this project available as possible while preserving the privacy of the 
individuals involved. Even where we are unable to publicly archive all of the raw data elements, we have provided 
complete code for review, along with partial data that allows researchers to replicate many of the cleaning steps, and 
an analysis file that allows researchers to reproduce our final results. (Researchers interested in obtaining the 
redacted individual identifying data should submit a request to the authors directly, jludwig@uchicago.edu)

### Data

The replication data for this project is found in the `data/raw` directory. There is a separate [README](data/README.md) 
which outlines all of the data used in the project. We have also included the original morphs used in the project, 
included in the `data/morphs` directory.

Because of the sensitive nature of much of the data used in this project, we have taken several steps to maintain the
privacy of the individuals involved as much as possible. First, we have masked all identifying variables that are not
required to replicate results. Secondly, we have hashed ID variables for datasets that are not already available online.
Thirdly, we have chosen to exclude both individually identifying information (names and DOB) and raw mugshots from the 
repository. Because we create our analysis files by linking together different raw datafiles using name and DOB, this 
means that the data we are archiving here allows users to replicate our results from our analysis file but not to 
reconstruct the analysis file from the raw files without this additional individual identifying information. 
Researchers interested in obtaining either this individual identifying information or the raw mugshots should submit a 
request to the authors directly (jludwig@uchicago.edu).

Researchers interested in proceeding from the analysis file should read the instructions for initialisation below, but 
instead of running scripts `s01` through to `s12` of the cleaning subdirectory, should instead move the file at 
`data/raw/aux/arrest_features-public.feather` to `data/clean/processed/arrest_features.feather`, and proceed as normal.


### Code

All code required to replicate the project is found in the `src` directory. This includes copies of two external
repositories, which may be included as git submodules or as pre-populated subdirectories. Each of these contain a 
license file, and are included here under the terms of their respective licenses. 


## Requirements

### Cleaning and Analysis

All of the data preparation and analysis code was run on a machine using the following specifications:

* Ubuntu 20.04 operating system,
* 244 GB of memory,
* 256 GB of storage space,
* 32 2.7 GHz Intel Xeon E5 2686 v4 CPUs, and
* 2 NVIDIA Tesla M60 GPUs.

In order to replicate the project environment exactly, users will need to install R version 4.2.2 and Python versions 
3.6.3 and 3.8.16. (We handle multiple installs of Python using [pyenv](https://github.com/pyenv/pyenv).) Instructions
for initializing virtual environments on a fresh machine are provided in a section below. Running the entire project 
will require more than two weeks' compute time on a single machine. Note that we do provide a final analysis file in 
the auxiliary data directory, which can be used to replicate the final results with a much smaller time investment 
(and, using the provided virtual environments, on a machine without any GPUs and much less memory). Some rough 
estimates of the time required to run each step are provided below. 

* Data cleaning can take up to 2 days
* Training the XGBoost models can take several hours
* Training the StyleGAN2 model takes around 5 days (on a different machine)
* Training all three CNN models takes around 4 days (although increasing the number of workers in the relevant
  configuration files can reduce this time, potentially at the cost of reproducibility)
* Producing morphs takes roughly 4 days (although reducing the number of seeds can reduce this time)
* Running the analysis scripts takes a few hours


### Training StyleGAN2

The StyleGAN2 was trained on a separate machine with the following specifications:

* Ubuntu 18.04 operating system,
* 488 GB of memory,
* 1 TB of storage space,
* 64 Intel Skylake CPUs, and
* 8 NVIDEA V100 GPUs.

A [Dockerfile](./src/x03_gan_training/stylegan2_tensorflow/Dockerfile) is provided by the original repository to
help users replicate the environment used to train the StyleGAN2 model, which will manage the install of all required
dependencies. The [README](./src/x03_gan_training/stylegan2_tensorflow/README.md) provided by the original repository
also provides estimates of the time required to train the StyleGAN2 model; we found that training took approximately
5 days to complete.


## Project Structure

Below is a simplified look at the directory structure for the complete project.

<!-- `tree -L 3 --filelimit 20` # then edited manually -->
```
faceeffect
├── README.md
├── configs
│   └── <8 config YAML files>
├── data
│   ├── README.md
│   ├── clean
│   ├── morphs
│   ├── mugshots
│   └── raw
│       ├── arrest_conviction_matches
│       ├── aux
│       ├── historical_arrests
│       ├── mcso_scraped
│       ├── ncdac
│       ├── ncaoc
│       └── survey_outputs
├── envs
│   ├── <3 Python requirements files>
│   └── renv.lock
├── faceeffect.Rproj
├── models
│   ├── release_baseline
│   ├── release_matched_skintone_wellgroomed
│   ├── skintone_wellgroomed
│   └── stylegan2
├── outputs
├── renv
│   └── activate.R
└── src
    ├── utils
    ├── x01_cleaning
    ├── x02_gan_training
    │   └── stylegan2_tensorflow
    ├── x03_cnn_training
    ├── x04_xgb_training
    ├── x05_morphing
    │   └── stylegan2_pytorch
    └── x06_analysis
```

The following directories are of particular interest:

* `configs`: YAML configurations that control various code steps.
* `data`: data required to reproduce the paper's results.
* `envs`: metadata to create virtual environments for various project stages.
* `models`: the save location all models trained throughout the project.
* `src`: where all the replication code exists.


### Code

The `src` directory, which contains the replication code, is further subdivided into 7 subdirectories:

* `utils`: 1 script
* `x01_cleaning`: 14 scripts
* `x02_xgb_training`: 2 scripts
* `x03_gan_training`: a single git submodule
* `x04_cnn_training`: 4 scripts
* `x05_morphing`: 5 scripts and a git submodule
* `x06_analysis`: 5 scripts


## Instructions for Replication

### Installing Software

The versions of R and Python are outlined in the Requirements section above.


### Populating git submodules

This project loads two external repositories. Depending on how you receive this code, they may come as-is, or be 
handled as git submodules. If you downloaded this codebase as a git repository, or if the submodules appear as empty
directories, it's likely that you will need to initialize them using the below shell commands. This will ensure that 
versions are pinned to the same commits as those used in the project. (There are additional instructions in the
README files of the [GAN training](./src/x03_gan_training/README.md) and 
[morphing](./src/x05_morphing/README.md) subdirectories.)

```sh
git submodule init
git submodule update
```


### Creating Virtual Environments

First, initialize an `renv` virtual environment from R using the lockfile as shown. This will activate automatically 
when starting R thanks to the `.Rprofile` and `renv/activate.R` files.

```R
# renv should initialize automatically when R opens, otherwise just install it via `install.packages("renv")`
renv::init(bare = TRUE)
renv::restore(lockfile = "envs/renv.lock")  # takes a bit; time to make some coffee
```

You must then create multiple Python virtual environments as shown. These are manually activated for the relevant
scripts, as shown in the relevant README files.

```sh
# install virtualenv using a fresh pip for versions 3.6.3 and 3.8.16
python3.6 -m pip install --upgrade pip
python3.8 -m pip install --upgrade pip
python3.6 -m pip install virtualenv
python3.8 -m pip install virtualenv

# create three distinct virtual environments
python3.8 -m virtualenv envs/py-cleaning
python3.6 -m virtualenv envs/py-cnn
python3.8 -m virtualenv envs/py-xgb

# install requirements for each environment
envs/py-cleaning/bin/python -m pip install -r envs/requirements-cleaning.txt
envs/py-cnn/bin/python -m pip install -r envs/requirements-cnn.txt
envs/py-xgb/bin/python -m pip install -r envs/requirements-xgb.txt
```

Each of the git submodules require their own environments---refer to the instructions in the relevant README documents.

### Running the Code

You must step through all code in each of the 6 numbered subdirectory contained in the `src` directory in order. Look 
for a README file in each subdirectory with instructions on how to complete these. The final outputs will be saved in
an `outputs` directory.


## Questions

For any remaining questions about the code or data, please contact the authors directly (jludwig@uchicago.edu).
